Introduction

Advances in single-cell biology has enabled us to investigate isolated single cells at an unprecedented resolution. Current methods can measure subtle changes in cell state, revealing specialized minor subpopulations within a cell type. However, functional characterization of such subpopulations is still challenging. A promising approach to untangling functional characterization is spatial transcriptomics, where genomewide measurements of gene expression are enriched with spatial attributes. Such approaches allow us to determine the relative positions of cell subpopulations. Unfortunately, current spatial transcriptomic approaches suffer either from problems with sensitivity, are limited to a restricted panel of genes, or the spatial attributes are of low resolution.
In order to deconvolute the cell connectome, we sort multiplets of incompletely dissociated cells along with single cells from the same sample. The single cells provide a set of blueprints of possible subcell types, and subsequently we use swarm optimization approaches to infer the composition of multiplets. Using this method we can measure the composition of a very large number of multiplets onto a lattice consisting of subcell types of arbitrary complexity.

Package summary

To accomplish these end-points an software package based on the R programming language has been deveolped named “spatial single cell RNA seq” or sp.scRNAseq. The package works in 3 seperate stages which are described below:

  1. A spCounts object is made by providing singlet and multuplet exprssion data.
  2. The number of cell types present in the tissue is determined by running tSNE on the singlet expression profiles.
  3. Swarm optimization is then used to deconvolute the multuplets, providing a map of cell-cell interactions.

Below we will demonstrate the sp.scRNAseq packages functionality and, in parallel, describe and show results from the testing of the package on in silico and biological data.

Synthetic data testing of multuplet deconvolution using swarm optimization

To begin we use in silico data to evaluate if swarm optimization can be used as a deconvolution method for scRNAseq multuplets. We utilize random data generated from the negative binominal distribution to simulate scRNAseq singlets (single cells) with the .syntheticSinglets function. This function generates a data set of 100 singlet expression profiles with 2000 genes each and 10 different cell types. These cell types were combined into multuplets consisting of 2, 3, or 4 cells using the .syntheticMultuplets function. This dataset was then used used for testing the deconvolution process.

#load spatial scRNAseq testing package
library(sp.scRNAseqTesting)

Singlets tSNE

After the data was generated, tSNE was run to view the clusering of the singlets and classify them into cell types. The results from the tSNE, plotted below, indicate that each cell clusters well within it’s cell type and that no overlap exists between groups of cell types. This represents the “ideal” situation and gives us a easy starting point to understand if swarm optimization may be an appropriate method to use even with less “well behaved” datasets.

#load synthetic data
data(syntheticData)

#make spCounts object
cObj <- spCounts(counts = syntheticData, counts.ercc=matrix(), sampleType = "m.")


#run unsupervised learning
uObj <- spUnsupervised(cObj, max=1000, max_iter = 1000)



#plot unsupervised results
spPlot(uObj, type = "clusters")

Figure 1. Unsupervised clustering of synthetic data singlets.


In the next, step swarm optimization was then evaluated as a method to deconvolute the multuplets, thus producing a cell-cell interaction map or “connectome”. To test this we generated multuplets based on the results of the post-tSNE classifications of the singlets. For example, to generate a multuplet representing the A1 and B1 group, one cell was picked from each group and the mean for each gene was calculated. Within the synthetic dataset multuplets were generated for all possible combinations of singlets using 2, 3, or 4 cells resulting in a total of 385 multuplets. We then picked 15 of these multuplets according to a specific number of connections desired between cell types (see table below). The swarm optimization was then tested using the spSwarm function . Due to the fact that we know which singlets were used to produce each multuplet and, therefore, which connections we expect, we can then measure the accuracy of the swarm optimization output.

Results synthetic data

#run pySwarm
result <- spSwarm(cObj, uObj, limit=10, maxiter=10, swarmsize=250, cores=5)


syntheticDataTable
from/to A1 B1 C1 D1 E1 F1 G1 H1 I1 J1
A1 1 1 1 0 0 2 0 0 1 0
B1 0 0 1 0 1 3 1 2 2 0
C1 0 0 0 1 3 3 1 1 3 1
D1 0 0 0 0 1 1 2 0 0 1
E1 0 0 0 0 0 1 1 2 2 2
F1 0 0 0 0 0 1 2 1 3 1
G1 0 0 0 0 0 0 0 0 0 1
H1 0 0 0 0 0 0 0 0 0 1
I1 0 0 0 0 0 0 0 0 0 1

Table 1. Expected connections in the synthetic multuplets.

Below we can see the network plot of the resulting connections detected after multuplet deconvloution. The singlets are clustered according to the previous tSNE results (small dots), the mean for each singlet classification group is represented by the large dots, and the detected interactions between groups (lines), and the number of interactions detected, or edge weights, are represented in the line thickness. Close inspection reveils that the detected connections mirror the expected connections in the table above.

#plot results
spPlot(result)

Figure 2. Newtork graph of the deteced multuplets in the synthetic data.

To view the results in a easier manner, the heatmap below shows the real vs detected connections with the number of connections represented as the heatmap’s cell colors. The fact that all of the connections lay on a diagnal through the plot indicates that the swarm optimization method had a 100% sucess rate.

tabels <- calculateConnections(getData(result, "codedSwarm"), type="codedSwarm")[[2]]
real <- data.frame(tabels[,,'real'])
detected <- data.frame(tabels[,,'detected'])
d <- data.frame(
    real = paste(
        real[rep(1:nrow(real), real[["Freq"]]), ]$from, 
        real[rep(1:nrow(real), real[["Freq"]]), ]$to, 
        sep="-"
    ),
    detected = paste(
        detected[rep(1:nrow(detected), detected[["Freq"]]), ]$from, 
        detected[rep(1:nrow(detected), detected[["Freq"]]), ]$to, 
        sep="-"
    )
)
table <- as.data.frame(table(d))

ggplot(table, aes(x=real, y=detected, fill=factor(Freq)))+
    geom_tile(alpha=0.85)+
    theme_few()+
    scale_fill_manual(values = c("lightgrey", "#77B7E5", "#F2F7E8", "#F9BD7E", "#C0343E", "#450b18"))+
    theme(
        axis.text.x=element_text(angle=90),
        legend.position="top"
    )+
    labs(
        x="Real Connections", 
        y="Detected Connections", 
        fill="Number of Connections"
    )

Figure 3. Real vs deteced connections in the synthetic data multuplets.

Expression data testing of multuplet deconvolution using swarm optimization

Expression data tSNE

Next we wanted to test swarm optimization using a dataset more representative of real data. To this end we used 131 fetal pancreas single cells that had been scRNA sequenced for this purpose. Below we can see the tSNE results from these singlets indicating that 8 cell types can be identified.

#load test data
data(expressionTestData)

#make spCounts object
cObj <- spCounts(counts = expressionTestData, counts.ercc=matrix(), sampleType = "m.")


#run unsupervised learning
uObj <- spUnsupervised(cObj, max=2500, max_iter=10^5)



#plot unsupervised results
spPlot(uObj, type = "clusters")

Figure 4. Unsupervised clustering of the expression data singlets.

Expression data results

Next we used the single cell data and combined single cells into multuplets of 2, 3, or 4 cells using the [assembleTestData][] function. This resulted in 162 multuplets from which we then picked 15 to provide a specific expected connectome and ran swarm optimization. We can again start by looking at the table showing the expected connectome and the resulting network plot.

#run pySwarm
result2 <- spSwarm(cObj, uObj, limit=10, maxiter=10, swarmsize=250, cores=5)


expressionTestTable
from/to A1 C1 D1 E1 F1 G1 H1
A1 1 2 2 0 3 2 1
B1 0 2 0 1 3 1 2
C1 0 1 2 1 3 2 4
D1 0 0 0 2 1 0 1
E1 0 0 0 0 2 1 2
F1 0 0 0 0 0 2 3
G1 0 0 0 0 0 0 4

Table 2. Expected connections in the expression data multuplets

#plot results
spPlot(result2)

Figure 5. Network graph of the connections detected in the expression data multuplets.

Plotting the results in the heatmap format indicate that swarm optimization is, again, able to deconvolute the multuplets within this dataset with a 100% sucess rate.

tabels <- calculateConnections(getData(result2, "codedSwarm"), type="codedSwarm")[[2]]
real <- data.frame(tabels[,,'real'])
detected <- data.frame(tabels[,,'detected'])
d <- data.frame(
    real = paste(
        real[rep(1:nrow(real), real[["Freq"]]), ]$from, 
        real[rep(1:nrow(real), real[["Freq"]]), ]$to, 
        sep="-"
    ),
    detected = paste(
        detected[rep(1:nrow(detected), detected[["Freq"]]), ]$from, 
        detected[rep(1:nrow(detected), detected[["Freq"]]), ]$to, 
        sep="-"
    )
)
table <- as.data.frame(table(d))

ggplot(table, aes(x=real, y=detected, fill=factor(Freq)))+
    geom_tile(alpha=0.85)+
    theme_few()+
    scale_fill_manual(values = c("lightgrey", "#77B7E5", "#F2F7E8", "#F9BD7E", "#C0343E", "#450b18"))+
    theme(
        axis.text.x=element_text(angle=90),
        legend.position="top"
    )+
    labs(
        x="Real Connections", 
        y="Detected Connections", 
        fill="Number of Connections"
    )

Figure 6. Real vs detected connections in expression data multuplets.

Experimental data testing of multuplet deconvolution using swarm optimization

Next, we used the sp.scRNAseq package on experimental data where both singlets and multuplets from fetal pancreatic tissue were physically isolated, via FACS sorting, previous to scRNAseq. The dataset includes 131 singlets and 69 multuplets.

Experimental data ercc

scRNAseq was performed with ercc spike-ins where a constant amount of nonhuman control RNA is spiked into each well when the plate is prepared. In the plot below we can see the ercc spike-in levels for singlets and multuplets. Multiplets have on average ~30% the number of ERCC reads, indicating that each contain on average three cells. The fraction depends on various other things, such as the size of the cell and how much of the cell’s RNA that was successfully recovered.

#load test data
data(expData)

#expData is already a spCounts object
cObj <- expData


spPlot(cObj, type="ercc")

Figure 7. Fraction of ERCC spike-ins in experimental data.

Experimental data markers

Next, we wanted to examine known cell identity marker expression within the singlets and multuplets. We expected that, whereas the singlets should exclusivley express these markers, the multuplets would show examples where multiple markers are expressed. Although some markers appear to be more promiscuous than others, we see that many of the markers follow this pattern.

INS THY1

markers <- c("INS", "THY1")
spPlot(cObj, type="markers", markers=markers)

INS GCG

markers <- c("INS", "GCG")
spPlot(cObj, type="markers", markers=markers)

FLT1 THY1

markers <- c("FLT1", "THY1")
spPlot(cObj, type="markers", markers=markers)

EPCAM THY1

markers <- c("EPCAM", "THY1")
spPlot(cObj, type="markers", markers=markers)

PROM1 THY1

markers <- c("PROM1", "THY1")
spPlot(cObj, type="markers", markers=markers)

PROM1 INS

markers <- c("PROM1", "INS")
spPlot(cObj, type="markers", markers=markers)

Figure 8. Cellular marker expression in experimental data singlets and multuplets.

Experimental data tSNE

Next, we ran tSNE and plotted the results. This reveals the presence of 9 distinct cell types within the tissue.

#run unsupervised learning
uObj <- spUnsupervised(cObj, max=2000, max_iter=20000)



#plot unsupervised results; cluster plot
spPlot(uObj, type="clusters")

Figure 9. Unsupervised clustering of the experimental singlets.

To relate the resulting tSNE-based classifications to previously known cell identities we further plotted the tSNE results showing expression levels (log 2 normalized counts) of several well known cell surface markers. Note that the dot colors in the plot represent the expression levels.

#plot unsupervised results; markers plot
markers <- c("EPCAM", "THY1", "FLT1", "INS", "GCG", "NEUROG3", "PROM1", "SST", "PRSS1")
spPlot(uObj, type="markers", markers=markers)

Figure 10. Cellular marker expression in post-tSNE classification groups.

Experimental data results

Finally, we used the sp.scRNAseq package to analyze the cell-cell interactions within the multuplets. The results indicate the presence of a high number of interactions between cell types B1 and A1, as well as, B1 and D1. Examination of the markers plot indicates that this may represent a differation process as there is a inverse correlation between NEUROG3 and INS in these groups. Further connections can be viewed in the plot.

#run pySwarm
result3 <- spSwarm(cObj, uObj, maxiter=10, swarmsize=250, cores=8)



#plot results
spPlot(result3)

Figure 11. Network plot of experimental data multuplet connections.

Conclusions

Functions

syntheticDataTest

syntheticDataTest <-
function (outPath = "data", cores = 5, n = 15, ngenes = 2000, 
    ncells = 100, cellTypes = 10, ...) 
{
    tmp <- .syntheticTestData(n, ngenes, ncells, cellTypes)
    syntheticData <- tmp[[1]]
    syntheticDataTest <- tmp[[2]]
    syntheticDataUnsupervised <- tmp[[3]]
    syntheticDataTable <- tmp[[4]]
    syntheticDataCounts <- spCounts(syntheticDataTest, matrix(), 
        sampleType = "m.")
    syntheticDataSwarm <- spSwarm(syntheticDataUnsupervised, 
        limit = "none", maxiter = 10, swarmsize = 250, cores = cores)
    save(syntheticData, file = paste(outPath, "syntheticData.rda", 
        sep = "/"), compress = "bzip2")
    save(syntheticDataCounts, file = paste(outPath, "syntheticDataCounts.rda", 
        sep = "/"), compress = "bzip2")
    save(syntheticDataUnsupervised, file = paste(outPath, "syntheticDataUnsupervised.rda", 
        sep = "/"), compress = "bzip2")
    save(syntheticDataSwarm, file = paste(outPath, "syntheticDataSwarm.rda", 
        sep = "/"), compress = "bzip2")
    save(syntheticDataTable, file = paste(outPath, "syntheticDataTable.rda", 
        sep = "/"), compress = "bzip2")
}

.syntheticTestData

(wraps .syntheticSinglets and .syntheticMultuplits into one function)

.syntheticTestData <-
function (n, ngenes, ncells, cellTypes, perplexity = 10) 
{
    tmp <- .syntheticMultuplets(ngenes, ncells, cellTypes, perplexity)
    singlets <- tmp[[1]]
    multuplets <- tmp[[2]]
    uObj <- tmp[[3]]
    testMultuplets <- multuplets[, sample(1:ncol(multuplets), 
        n, replace = FALSE)]
    table <- calculateConnections(testMultuplets, type = "multuplets")
    names <- c(paste("s", colnames(singlets), sep = "."), paste("m", 
        colnames(multuplets), sep = "."))
    testNames <- c(paste("s", colnames(singlets), sep = "."), 
        paste("m", colnames(testMultuplets), sep = "."))
    testSyntheticData <- cbind(singlets, testMultuplets)
    syntheticData <- cbind(singlets, multuplets)
    colnames(testSyntheticData) <- testNames
    colnames(syntheticData) <- names
    testSyntheticData <- as.matrix(testSyntheticData)
    syntheticData <- as.matrix(syntheticData)
    counts(uObj) <- testSyntheticData
    counts.log(uObj) <- sp.scRNAseq:::.norm.log.counts(testSyntheticData)
    sampleType(uObj) <- c(getData(uObj, "sampleType"), rep("Multuplet", 
        n))
    return(list(syntheticData, testSyntheticData, uObj, table))
}

.syntheticSinglets

.syntheticSinglets <-
function (ngenes, ncells, cellTypes) 
{
    for (i in 1:cellTypes) {
        set.seed(i)
        meanExprs <- 2^runif(ngenes, 0, 5)
        counts <- matrix(rnbinom(ngenes * ncells, mu = meanExprs, 
            size = i), nrow = ngenes)
        if (i == 1) {
            singlets <- counts
        }
        else {
            singlets <- cbind(singlets, counts)
        }
    }
    colnames(singlets) <- paste(sort(rep(letters, ncells))[1:(cellTypes * 
        ncells)], 1:ncells, sep = "")
    singlets <- as.data.frame(singlets)
    return(singlets)
}

.syntheticMultuplets

.syntheticMultuplets <-
function (ngenes, ncells, cellTypes, perplexity) 
{
    singlets <- .syntheticSinglets(ngenes, ncells, cellTypes)
    cObj <- spCounts(as.matrix(singlets), counts.ercc = matrix(), 
        sampleType = "[A-Z]")
    uObj <- spUnsupervised(cObj, max = ngenes, max_iter = 1000, 
        perplexity = perplexity)
    colnames(singlets) <- getData(uObj, "classification")
    mean <- getData(uObj, "groupMeans")
    combos <- t(matrix(rep(unique(colnames(singlets)), 2), ncol = 2))
    for (y in 1:ncol(combos)) {
        current <- combos[, y]
        new <- data.frame(rowMeans(data.frame(mean[, colnames(mean) %in% 
            current[1]], mean[, colnames(mean) %in% current[2]])))
        if (y == 1) {
            multuplets <- new
            names <- paste(current, collapse = "")
        }
        else {
            multuplets <- cbind(multuplets, new)
            names <- c(names, paste(current, collapse = ""))
        }
    }
    combos <- combn(unique(colnames(singlets)), 2)
    for (y in 1:ncol(combos)) {
        current <- combos[, y]
        new <- data.frame(rowMeans(mean[, colnames(mean) %in% 
            current]))
        multuplets <- cbind(multuplets, new)
        names <- c(names, paste(current, collapse = ""))
    }
    combos <- combn(unique(colnames(singlets)), 3)
    for (u in 1:ncol(combos)) {
        current <- combos[, u]
        new <- data.frame(rowMeans(mean[, colnames(mean) %in% 
            current]))
        multuplets <- cbind(multuplets, new)
        names <- c(names, paste(current, collapse = ""))
    }
    combos <- combn(unique(colnames(singlets)), 4)
    for (u in 1:ncol(combos)) {
        current <- combos[, u]
        new <- data.frame(rowMeans(mean[, colnames(mean) %in% 
            current]))
        multuplets <- cbind(multuplets, new)
        names <- c(names, paste(current, collapse = ""))
    }
    colnames(multuplets) <- names
    return(list(singlets, multuplets, uObj))
}

expressionDataTest

expressionDataTest <-
function (outPath = "data", cores = 5, n = 15, ...) 
{
    tmp <- .expressionTestData(n)
    expressionTestData <- tmp[[1]]
    testExpTestData <- tmp[[2]]
    expressionTestUnsupervised <- tmp[[3]]
    expressionTestTable <- tmp[[4]]
    expressionTestDataCounts <- spCounts(testExpTestData, matrix(), 
        "m.")
    expressionTestSwarm <- spSwarm(expressionTestUnsupervised, 
        limit = "none", maxiter = 10, swarmsize = 250, cores = cores)
    save(expressionTestData, file = paste(outPath, "expressionTestData.rda", 
        sep = "/"), compress = "bzip2")
    save(expressionTestDataCounts, file = paste(outPath, "expressionTestDataCounts.rda", 
        sep = "/"), compress = "bzip2")
    save(expressionTestUnsupervised, file = paste(outPath, "expressionTestUnsupervised.rda", 
        sep = "/"), compress = "bzip2")
    save(expressionTestSwarm, file = paste(outPath, "expressionTestSwarm.rda", 
        sep = "/"), compress = "bzip2")
    save(expressionTestTable, file = paste(outPath, "expressionTestTable.rda", 
        sep = "/"), compress = "bzip2")
}

.expressionTestData

.expressionTestData <-
function (n) 
{
    counts <- getData(expData, "counts")
    sampleType <- getData(expData, "sampleType")
    sng <- counts[, sampleType == "Singlet"]
    unsupervised <- spUnsupervised(expData, max = 2500, max_iter = 10^5)
    classification <- getData(unsupervised, "classification")
    means <- getData(unsupervised, "groupMeans")
    dataset <- data.frame(row.names = 1:nrow(sng))
    names <- c()
    tmp <- .assemble(means, classification, dataset, names, 2, 
        type = "homo")
    dataset <- tmp[[1]]
    names <- tmp[[2]]
    tmp <- .assemble(means, classification, dataset, names, 2, 
        type = "hetero")
    dataset <- tmp[[1]]
    names <- tmp[[2]]
    tmp <- .assemble(means, classification, dataset, names, 3, 
        type = "hetero")
    dataset <- tmp[[1]]
    names <- tmp[[2]]
    tmp <- .assemble(means, classification, dataset, names, 4, 
        type = "hetero")
    dataset <- tmp[[1]]
    names <- tmp[[2]]
    colnames(dataset) <- paste("m.", names, sep = "")
    colnames(sng) <- paste("s.", classification, sep = "")
    testMultuplets <- dataset[, sample(1:ncol(dataset), n, replace = FALSE)]
    table <- calculateConnections(testMultuplets, type = "multuplets")
    expTestData <- as.matrix(cbind(sng, dataset))
    testExpTestData <- as.matrix(cbind(sng, testMultuplets))
    counts(unsupervised) <- testExpTestData
    counts.log(unsupervised) <- sp.scRNAseq:::.norm.log.counts(testExpTestData)
    sampleType(unsupervised) <- c(rep("Singlet", ncol(sng)), 
        rep("Multuplet", n))
    return(list(expTestData, testExpTestData, unsupervised, table))
}

.assemble

.assemble <-
function (means, classification, dataset, names, x, type) 
{
    if (type == "hetero") {
        comb <- combn(unique(classification), x)
    }
    else {
        comb <- t(matrix(rep(unique(classification), 2), ncol = 2))
    }
    for (i in 1:ncol(comb)) {
        currentMult <- rowMeans(means[, comb[, i]])
        name <- paste(comb[, i], sep = "", collapse = "")
        dataset <- cbind(dataset, currentMult)
        names <- c(names, name)
    }
    return(list(dataset, names))
}

experimentalDataTest

experimentalDataTest <-
function (outPath = "data", cores = 5, ...) 
{
    data(expData)
    cObj <- expData
    experimentalDataUnsupervised <- spUnsupervised(cObj, max = 2000, 
        max_iter = 2 * 10^4)
    experimentalDataSwarm <- spSwarm(experimentalDataUnsupervised, 
        limit = "none", maxiter = 10, swarmsize = 250, cores = cores)
    save(experimentalDataUnsupervised, file = paste(outPath, 
        "experimentalDataUnsupervised.rda", sep = "/"), compress = "bzip2")
    save(experimentalDataSwarm, file = paste(outPath, "experimentalDataSwarm.rda", 
        sep = "/"), compress = "bzip2")
}